Data Visualization Principles
Minimalism
One of Tufte’s ideas is data-to-ink ratio, generally stating that when creating a visualization, you should try to maximize the amount of “ink” (physical or digital) used to represent the data, and minimize the amount used to represent everything else not important to understanding.
theme_classic() in ggplot()
Gestalt Principles
Gestalt psychology is a theory of perception that believes humans are inclined to understand objects as an entire structure rather than the sum of its parts.
Gestalt Principles for Data Visualization
The Figure and Ground Principle describes the capacity to perceive the relationship between form and surrounding space to create meaning. A sense of wholeness or unity depends on how you perceive the relationship between an object and the area in which it is contained. The ‘figure’ is the focus element, while the ‘ground’ is the figure’s background.
theme_bw() in ggplot()
ggplot() themes
Moving into ANOVA
Before…
Now…
Goal of an ANOVA
Analysis of variance (ANOVA) compares the means of three of more groups to detect if the means of the groups are different.
How???
Visualizing Group Differences
We want visualizations that allow for us to easily compare:
What can you say about the differences between the groups?
What can you say about the variability within the groups?
Carrying out an ANOVA
Step 1: Compare your groups
Step 2: Find the overall mean
This ignores the groups and finds one mean for every observation!
Step 3: Find the group means
Step 4: Calculate the sum of squares
Step 5: Calculate the F-statistic
Can an F-statistic be negative?
Step 6: Find the p-value
F-distribution
An \(F\)-distribution is a variant of the \(t\)-distribution, and is also defined by degrees of freedom.
This distribution is defined by two different degrees of freedom:
Two degrees of freedom!
Changing the numerator degrees of freedom
Changing the denominator degrees of freedom
Do you always use an F-distribution to get the p-value?
NO!
Conditions of an ANOVA
What do you think?
If the normality condition is violated what type of method should we use?
Simulation-based Methods
Step 1: Calculating the Observed F-statistic
Response: min_eval (numeric)
Explanatory: age_cat (factor)
# A tibble: 1 × 1
stat
<dbl>
1 1.41
Step 2: Simulating what could have happened under \(H_0\)
How could we use cards to simulate what minimum evaluation score a professor would have gotten, if their score was independent from their age?
Another Permutation Distribution
Another Permutation Distribution
Why doesn’t the distribution have negative numbers?
Visualizing the p-value
What would you conclude regarding the mean minimum evaluation score and different age groups of professors?
Theory-based Methods
Using aov()
# A tibble: 2 × 6
term df sumsq meansq statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age_cat 3 1.24 0.414 1.41 0.244
2 Residuals 90 26.4 0.293 NA NA
How was the statistic calculated?
What distribution was used to calculate the p.value?
What would you conclude regarding the mean minimum evaluation score and different age groups of professors?
Did the two methods yield different results?
Next…